Title: HDIP_DViz_CA1

Module: Data Visualization

Assignment Brief

- I am an employee of a retail company and it is my job to analyse the following dataset "board_games.csv" which is based on board games. This analysis will help determine the sales strategy for the company I work for in their upcoming Winter season.
- All questions will be answered using separate visualizations that will uncover helpful insights.
- All decisions in this report will be rationalized in order to evidence and prove to The Chief Technology Officer (CTO) why a certain decision and outcome was made/reached.
- The rationalization includes the visualization of design decisions, how the data was engineered, feature selection and any other information that is relevant.

Imports

Importing in the required libraries to allow the data to be read into python. These functions are the tools that allow the data to be manipulated and analysed.

Helper Functions

Functions used to change the screen sizing of the notebook and enhance plots.

Loading the Data

Reading in the data set from where it is stored on my PC by creating a dataframe called raw_df. The data is housed in a .csv file (comma separated values).

raw_df.head() shows us the first 5 rows, this can be increased by inserting a number inside the brackets.

raw_df.tail() shows us the last 5 rows, this can be increased by inserting a number inside the brackets.

Data Description

Data Dictionary

A data dictionary is a practical depiction of the metadata connected with a data object, such as a database, table, or column (Octopai, 2019).The data dictionary helps give an understanding as to what each feature means and is very useful at the start of any data analysis project, for both the user and any person using/reading the data at a later date.

Feature Description
game_id Unique game identifier
description A paragraph of text describing the game
image URL image of the game
max_players Maximum recommended players
max_playtime Maximum recommended playtime (min)
min_age Minimum recommended age
min_players Minimum recommended players
min_playtime Minimum recommended playtime (min)
name Name of the game
playing_time Average playtime
thumbnail URL thumbnail of the game
year_published Year game was published
artist Artist for game art
category Categories for game (separated by commas)
compilation If part of a multi-compilation-name of compilation
designer Game designer
expansion If there is an expansion pack - name of expansion
family Family of game - equivalent to publisher
mechanic Game mechanic - how game is played, separated by comma
publisher Company/person who published the game, separated by comma
average_rating Average rating on Board Games Geek (1-10)
users_rated Number of users that rated the game

Renaming of Columns

In my view, there is no requirement to rename the column names as they are in the appropriate format as is, that is all the feature names are all in lower case and have the snake case applied.

Shape of Data

raw_df.shape reveals the dimensions of the dataframe. Ultimately it shows the number of observations and features the dataset possesses.

Data Types

raw_df.dtypes shows what the data types are for each variable/feature in the dataset.

Data Makeup

raw_df.info() reveals important insights about the data in the dataset. Useful insights such as what data types are the variables, the count of each variable, the number of columns/variables there are and the memory usage the dataset is placing on the laptops harddrive.

Null Values Check

raw_df.isnull().sum() helps us to investigate if there are any null values present.

Handling Null Values

The following piece of code helps to drop the null values and this is an important step that will help the analysis down the line.

Checking if Null Values were dropped

raw_df.isnull().sum() confirms whether all the null values have been dropped or not.

Duplicates Check

Checking for duplicated data in the next lines of code.

Analysts Strategy

I am removing certain features for a few reasons. This is due to certain variables either having a large chunk of null values or some variables offering no value as per the project brief in mind, as an analyst, one must make decisive decisions to focus the approach and limit the resources used to uncover the key information. In my opinion, by examining the project brief and the objectives of same, there is no requirement to keep the following variables; "image", "thumbnail", "compilation", "expansion", "artist", "family", "designer".

The reasons behind this are as follows:

Dropping columns

New dataframe

Here I am making a copy of the raw data and I am doing so by creating a new dataframe to house this.

Descriptive Statistics

Descriptive Statistics is used next, this will help me to better understand the data. I will start by splitting data between numerical and categorical attributes. Numerical features will be the ones which type is equal to 'int64' and 'float64'. Categorical features will be all other than 'int64', 'float64', and 'datetime64'.

The study of both Numerical and Categorical attributes helps to provide a short analysis of the features. This study will help me to understand the scope and needs to prepare for future steps, such as Exploratory Data Analysis (EDA).

Splitting the data into numerical and categorical features

Numerical Attributes

Here I am using .T in order to transpose the attribute to array. The central tendency such as mean, median among others will be looked at.

The below code will help reveal the dispersion of the data - standard deviation (std), minimum (min), maximum (max), range, skew, kurtosis

The last line of code uses pd.concat which merges the dataframes together and shows them in a nice table.

Categorical Attributes

Exploratory Data Analysis

Distribution of numerical data

Similar to the pairwise plot, using the histogram to show how the data is distrubted is a very useful insight to get.

Data Relationships

A basic correlogram is formed using the below sns.pairplot(df_1) code, the data is plotted using a scatter plot for each variable/feature. This shows the overall distribution and shape of the data individually and relationship each has to one and another. It is a high level overview of the distribution of the data one is working with.

Correlation Matrix

A correlation matrix is very useful to reveal any relationships among the variables. The following code sets the range of values to be displayed on the colormap from -1 to 1. Moreover the second line gives a title to the heatmap. The pad defines the distance of the title from the top of the heatmap.

Text Analysis - High Level

Here I installed wordcloud which is a python program that generates a wordcloud showing the most common words within the text columns in the dataframe. In order to install this I used - ! pip install wordcloud. Uncovering such information is not highly informative but it does give an incline into what words are used the most and most associated with board games.

Wordcloud of the variable "mechanic" and its text
Wordcloud of the variable "category" and its text
Wordcloud of the variable "description" and its text

Feature Engineering

Feature engineering is the initial stage in a machine learning channel and it comprises of all the techniques utilised to clean existing datasets. It improves their signal-noise ratio, and reduces their dimensionality (Bonaccorso, 2017).

Outlier Detection and Handling

Firstly a list is created for columns with numeric variables present, this assists the whisker plot generation.

Whisker Plots

Using whisker plots for all columns with numeric variables present to detect the presence of any outliers

Scaling of the DATA

Min Max Scaler

Rescaling is a common pre-processing task in machine learning where it places all features on the same scale, typically 0 to 1 or –1 to 1. There are a number of rescaling techniques, but one of the simplest is min-max scaling. Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range.

Robust Scaler was used to help with outliers.

Robust Scaler

Assignment Brief

Part 1

1.A) What are the top 5 “average rated” games?

The below code produces the top 5 average_rated games and keeps it in a new dataframe called df_top5

Next I reveal the top 5 average rated board games by placing the results from the above line of code into a table.

The top 5 average rated games are as follows; 1) Small World Designer Edition, 2) Kingdom Death: Monster, 3) Terra Mystica: Big Box, 4) Last Chance for Victory and 5) The Greatest Day: Sword, Juno and Gold Beaches

The following code produces a countplot and this visually shows the top 5 average rated games, all of which are very close to each other, as seen in the above table.

The barplot is a very simple and easy graph to use. The colours chosen are bright and vibrant and display the differences between games nicely. The font colour and size are kept to the standard as it is best to keep them from taking away from the overal visual and impeding in on the bars.

I decided to search for the top three average rated games and have uploaded an image of the actual board the game is played on and its pieces, to provide an insight as to what the board games look like.

No1 Average Rated Game - "Small World Designer"

No2 Average Rated Game - "Kingdom Death Monster"

No3 Average Rated Game - "Terra Mystica Big Box"

1.B) Is there a correlation between the variables “users_rated” and the “max_playtime”?

Pearsons Coefficient - this will provide an initial basis to work from and will help determine if there is a correlation evident

The correlation coefficient or Pearson's r, is a measure of both the strength and direction of the linear relationship between two variables. The statistic will be a number between -1 and 1 where -1 is the total negative correlation and +1 is the total positive correlation (Lesmeister, 2015). Therefore the result above of -0.009 shows there is no real relationship between the variables users_rated and max_playtime.

The above output is broken down as follows (Vallat, 2019);

n

r

r2 & adj_r2

p-val

BF10

power

Plotting users_rated & max_playtime to understand can we visually uncover a relationship between the data

I chose a scatter plot here as it is very usable. The data is graphed very simply and it reveals the relationship with minimal fuss. The colours, font size and style are all kept to the standard.

Correlation Analysis between 1) users_rated and 2) max_playtime conclusion

Correlation analysis between two variables, 1) users_rated and 2) max_playtime.

Causation explicitly applies to cases where action A causes outcome B. On the other hand, correlation is simply a relationship. Action A relates to Action B—but one event doesn’t necessarily cause the other event to happen (Madhavan, 2019). If you do not confirm causality it can lead to a false positive result, where you believe an underlying relationship exists but in fact it isn’t there. This can be very problematic.

In the results obtained from our correlation analysis, it is clear there is no correlation between the two variables. The more users that rate the game has no affect on max playing time. However from the above results it is unclear and further tests must be carried out to see exactly if thats the case.

> C) What is the distribution of game categories?

1.D) Do older games (1992 and earlier) have a higher MEAN “average rating” than newer games (after 1992)?

Here I am going to use the .loc function to firstly gather the data from 1992/older and then secondly to gather the data from 1993/newer.

Result of Study

The mean is higher for games older than 1992 i.e. games from 1993 to 2016

The below graphs will visually show this.

Visual representation of mean from 1950 to 2016

Visual representation of the mean from 1950 to 1992

Visual representation of the mean from 1993 to 2016

1.E) What are the 3 most common “mechanics” in the dataset?

Part 2

2. Statistical Relevant Question - Research and Answer

In this section I will carry out independent research in an area of my interest. The aim of this is to provide valuable insights to the company and help them with their sales strategy for the upcoming winter season. The area I have selected to focus on revolves around what is the target market to best target for board games.

Thinking about Winter, lots of things come to mind such as the weather but ultimately Christmas is the pinnacle of every Winter season and additionally the purchase of toys and the gifting of presents. Furthermore, Santa comes to mind to the kids from the years of 12 and below.

In my analysis I have decided to target the children’s market to sell board games to. I took the key variables that I felt were important to have and housed them in a new dataframe. I furthermore carried out the below steps to help uncover an insight. I will provide a large-scale synopsis of the results at the end of these steps.

New Dataframe

Below I singled out the variables I wished to focus on, this helps my graphs and keeps things neat and tidy.

Checking the data with new_df.info() to make sure everything is okay from the previous steps in data cleaning and feature engineering

Examining Minimum Playtime Graphically

In the below graph it is clear to see there is an outlier in the data which is circa 1980. Other than that the insight gained from this graph is that the minimum playing time of board games has slowly decreased in time over the years and is still decreasing in the latest year of the data, which is 2016.

Examining Minimum Players Graphically

The minimum players is also decreasing over time

Examining Minimum Age Graphically

The minimum age has gone up and down over the years and is on the rise according to the latest year of the data, which is 2016.

A relplot was used which is very exact and nicely represents the data.

Narrowing the Focus

I wanted to obtain the board games for children under the age 13, so that is why I used this code

Ascertaning the board games that require less than 3 players

Performing descriptive statistics on the data in the new dataframe

Gathering the data where the minimum playing time on board games is less than or equal to 30 mins.

Next I am selecting board games past the year 1999

The barplot was used to keep the results clean and show exactly which game was best rated and the one that I would definitely advise be stocked in the shops for the Winter.

Overall Findings

One of the most common questions about board games is: “How long does the game take to play?” (The Board Game Family, 2021) According to this dedicated webpage, 30 minutes represents the ideal amount of time to play a board game. Nowadays, our attention spans are lower and we are always on the go and haven’t much time to spend, therefore playing a board game in a short space of time is the ideal scenario.

This drove my research as parents buying their kids board games understand that they will end up playing the game with them. That is why I chose to identify the games that take less than 30 minutes to play, can be played with two players or less and finally is ranked based on the users rated. Therefore the most users that use these games indicate to me that these games are quite popular and well known and so will not take much money to advertise their name. In essence that is my strategy to the sales department and the top 20 games I would stock in the shops are shown above.

Part 3

Throughout the code and markdown is where I have addressed the questions to Part 3.

References

Bonaccorso, G., 2017. Machine Learning Algorithms. Birmingham: Packt Publishing.

Lesmeister, C., 2015. Mastering Machine Learning with R.. Birmingham: Packt.

Madhavan, A., 2019. Amplitude. [Online] Available at: https://amplitude.com/blog/causation-correlation [Accessed 20th October 2021].

Octopai, 2019. Octopai. [Online] Available at: https://www.octopai.com/automatic-data-dictionary-mapping-using-machine-learning/ [Accessed 27th October 2021].

Vallat, R., 2019. raphaelvallat.com. [Online] Available at: https://raphaelvallat.com/correlation.html [Accessed 28th October 2021].

The Board Game Family, 2021. The Board Game Family. [Online] Available at: https://www.theboardgamefamily.com/2012/01/the-best-game-length/ [Accessed 29th 0ctober 2021].